arXiv:1502.03529v3 [cs.LG] 20Jul2015 


Scalable Stochastic Alternating Direction Method 

of Multipliers 


Shen-Yi Zhao Wu-Jun Li Zhi-Hua Zhou 
Department of Computer Science 
National Key Laboratory for Novel Software Technology 
Nanjing University, China 

July 21, 2015 


Abstract 

Most stochastic ADMM (alternating direction method of multipliers) 
methods can only achieve a convergence rate which is slower than 0{1/T) 
on general convex problems, where T is the number of iterations. Hence, 
these methods are not scalable in terms of convergence rate (computation 
cost). There exists only one stochastic method, called SA-ADMM, which 
can achieve a convergence rate of 0(l/r) on general convex problems. 
However, an extra memory is needed for SA-ADMM to store the historic 
gradients on all samples, and thus it is not scalable in terms of storage 
cost. In this paper, we propose a novel method, called sca lable stochastic 
ADMM (SCAS-ADMM), for large-scale optimization and learning prob¬ 
lems. Without the need to store the historic gradients on all samples, 
SCAS-ADMM can achieve the same convergence rate of 0(1/T) as the 
best stochastic method SA-ADMM and batch ADMM on general convex 
problems. Experiments on graph-guided fused lasso show that SCAS- 
ADMM can achieve state-of-the-art performance in real applications. 


1 Introduction 

The alternating direction method of multipliers (ADMM) [T] is proposed to 
solve the problems which can be formulated as follows: 

min P(x,y) =/(x)-hg(y) (1) 

x,y 

s.t. Ax -b By = c, 

where /(•) and g{-) are convex functions, A C and B G are matrices, 
c G K* is a vector, x G and y G are variables to be optimized (learned). 
By splitting the objective function P(-) into two parts /(•) and g{-), ADMM pro¬ 
vides a flexible framework to handle many optimization problems. For example, 
by taking /(x) to be the square loss or logistic loss on the training set, g{y) to 
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be the Li-norm and the constraint to be x — y = 0, we can get the well-known 
lasso formulation [2]. Similarly, we can take more complex constraints than 
that in lasso to get more complex regularization problems such as the struc¬ 
tured sparse regularization problems Hill- Compared with other optimization 
methods such as gradient decent, ADMM has demonstrated better performance 
in many complex regularization problems Hill- Furthermore, ADMM can be 
easily adapted to solve large-scale distributed problems [T]. Hence, ADMM has 
been widely used in a large variety of areas [I]. 

Deterministic (batch) ADMM needs to visit all the samples in each iteration. 

Existing works have shown that batch ADMM is not efficient enough for big data 
applications with a large amount of training samples UM- Stochastic (online) 

ADMM, which visits only one sample or a mini-batch of samples each time, has 
recently been proved to achieve better performance than batch ADMM [alia 
111!. Hence, stochastic ADMM has become a hot research topic and attracted 
much attention la]. 

Online alternating direction method (OADM) 1 is the first online ADMM 
method. There is only regret analysis in OADM, based on which we can find 
that if OADM is adapted for stochastic settings with finite samples, the con¬ 
vergence rate of OADM is 0(1/Vt) for general convex problems where /(x) 
and g{y) are convex but not necessarily to be strongly convex. Here, T is the 
number of iterations. Besides OADM, several stochastic ADMM methods have 
been proposed, including stochastic ADMM (STOC-ADMM) 1, regularized 
dual averaging ADMM (RDA-ADMM) |3], online proximal gradient descent 
based ADMM (OPG-ADMM) |3], optimal stochastic ADMM (OS-ADMM) |7], 
and stochastic average ADMM (SA-ADMM) |4]. STOC-ADMM, RDA-ADMM, 

OPG-ADMM and OS-ADMM achieve a convergence rate of 0{1/^/T) for gen¬ 
eral convex problems, worse than batch ADMM that has a convergence rate 
of 0(1/T) 1. Different from STOC-ADMM, RDA-ADMM, OPG-ADMM and 
OS-ADMM, SA-ADMM |4] can achieve a convergence rate of 0(1/T) for general 
convex problems by using historic gradients to approximate the full gradients 
in each iteration. Thus, SA-ADMM is the only one which is scalable in terms 
of convergence rate (computation cost). However, SA-ADMM requires an ex¬ 
tra memory which is typically very large to store the historic gradients on all 
samples, making it not scalable in terms of storage cost. 

In this paper, we propose a novel method, called scalable stochastic ADMM (SCAS-ADMM), 
for large-scale optimization and learning problems. The main contributions of 
SCAS-ADMM are outlined as follows: 

• SCAS-ADMM achieves the same convergence rate of 0(1/T) for general 
convex problems as the best existing stochastic ADMM method (SA- 
ADMM) and batch ADMM. Therefore, SCAS-ADMM is scalable in terms 
of convergence rate (computation cost). 

• Different from SA-ADMM, SCAS-ADMM does not need an extra memory 
to store the historic gradients on all samples. Therefore, SCAS-ADMM is 
scalable in terms of memory (storage) cost. 


2 


• Experimental results on graph-guided fused lasso [5] show that SCAS- 
ADMM can achieve state-of-the-art performance in real applications. 

2 Background 

2.1 Convex and Smooth Functions 

We use ||a|| to denote the Euclidean (L 2 ) norm of a. A function h{-) is called 
Lipschitz continuous if: > 0, Va,b, ||/i(b) — /i(a)|| < Ijb — a||. Assume 

h{-) is differentiable, and let V/i(a) denote the gradient of h{-) at a. A function 
h{-) is called convex if: Va, b, h{h) > h{a) -|- [V/i(a)]^(b — a). Assume h{-) is 
convex and differentiable. h{-) is called zz/j-smooth if: > 0,Va, b, h{h) < 

h{a) + [V/i(a)]^(b — a) -|- ^ ||b — a||^. This is equivalent to say that Vh(-) 
is zz^-Lipschitz continuous. Here, zz^ is called the Lipschits constant of h{-). 
A function h(-) is called strongly convex if: > 0, Va, b, h(h) > h{a) + 

[V/i(a)]^(b — a) -I- ^ ||b — a|| . A function h{-) is called general convex if h{-) 
is convex but not necessarily to be strongly convex. 

2.2 ADMM 

ADMM solves Q based on the augmented Lagrangian function: 

A(x,y,/3) =f{x)+g{y)+(3'^{Ax + By-c) + ^ ||Ax-|-By- cf , (2) 

where /3 is a vector of Lagrangian multipliers, and p > 0 is a penalty parameter. 

Just like the Gauss-Seidel method, ADMM iteratively updates the variables 
in an alternating manner as follows [1]: 


Xt+i = argminL(x,yt,/3t), 

X 

(3) 

yt+i = argminL(xt+i,y,/3(), 
y 

(4) 

f3t+i = (3t + p(Axt+i -F Byt+i - c). 

(5) 


where xt, yt and f3t denote the values of x, y and (3 at the tth iteration, 
respectively. 

In the regularized risk minimization problem which this paper will focus on, 
the function /(x) usually has the following structure: 

n 

= ( 6 ) 

n 

i=l 

where x denotes the model parameter, n is the number of training samples, and 
each /i(-) is the empirical loss caused by the Ah sample. The function g{y) 
is usually a regularization term. For example, /i(x) = log(l -I- exp”^''^* in 
logistic regression (LR), and /i(x) = {bt — afx)^ in least square, where {ai,bi) 
is the zth training sample with the class label bi. Taking g{y) = ||y||]^ and the 
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constraint y = x, we can get the lasso formulation [5]. Similarly, we can get 
more complex regularization problems by taking more complex constraints like 
y = Ax. 

Unless otherwise stated, /(x) of the problem we are trying to solve in this 
paper is defined in (§. Then ^ becomes: 

Xt+i = argminj-V/j(x) + (^t)^(Ax + Byt - c) + ^ ||Ax + By 4 - cf}. 

X n I 

2 = 1 

(7) 

From Q, it is easy to see that ADMM needs to visit all the n samples in 
each iteration. Hence, this version of ADMM is also called batch ADMM or 
deterministic ADMM. Some works U i have proved that the above batch 
ADMM has a convergence rate 0(1/T) for general convex problems where /(x) 
and g(y) are convex but not necessarily to be strongly convex, where T is the 
number of iterations. 

Different from batch ADMM, stochastic (online) ADMM visits only one 
sample or a mini-batch of samples in each iteration. Recent works have shown 
that stochastic ADMM can achieve better performance than batch ADMM to 
handle large-scale datasets in terms of computation complexity and accuracy [H 
E]. The computation of Q and ([^ for both batch ADMM and stochastic 
ADMM are the same, which can typically be easily completed. Hence, different 
stochastic ADMM methods mainly focus on proposing different solutions for 0. 

3 Scalable Stochastic ADMM 

In this section, we present the details of our SCAS-ADMM, which is scalable 
in terms of both convergence rate and storage cost. Similar to most existing 
stochastic ADMM methods which adapt stochastic gradient descent (SGD) or 
its variants [ini HIl HH 1131 El] to solve the problem in Q , SCAS-ADMM is 
also inspired by an existing SGD method called stochastic variance reduced 
gradient (SVRG) [15]. But different from SVRG, our SCAS-ADMM can be 
used to model more complex problems with equality constraints. 

In this paper, we assume that /(•) and all the {/i{-)} are ^/-smooth. For 
g{-), we only assume it to be convex, but not necessarily to be smooth or Lip- 
schitz continuous. This is a reasonable assumption for many machine learning 
problems, such as the lasso with logistic loss or square loss. The proof of the 
theorems of this paper can be found from the Appendix in the supplementary 
materials. 

3.1 General Convex Problems 

In the general convex problems, /(•) is uy-smooth and general convex but not 
necessarily to be strongly convex. 
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3.1.1 Algorithm 


As in existing stochastic ADMM methods [Sll!, the update rules for y and 
/3 are still the same as those in 0 and 0- We only need to design a new 
strategy to update x. The algorithm for our SCAS-ADMM is briefly presented 
in Algorithmic It changes Q to be: 


Xt+l — 


E lvlt — 1 
m=0 

Mt 


( 8 ) 


where Mt is a parameter denoting the number of iterations in the inner loop, 
and 


Wo = xt, 

w„^+l = TTx{v^m - bt[^/*m(Wm) “ V/i„(wo) + Zj + A^/3t + pA”^(Aw„^ + Byt - c)]), 

(9) 


with im being an index randomly sampled from {1, 2, • • • , n}, Zj = Vf{:x.t) = 
being the full gradient at Xj, X being the domain of x, and 
TTxi') denoting the projection operation onto the domain X. 


Algorithm 1 SCAS-ADMM for general convex problems 
Initialize: (xo,yo,/3o), a convex set X 
for t = 0 to T — 1 do 

Compute zt = V/(xi) = ^ V/^xt); 

Wo = xt; 
s = Wo; 

for m = 0 to Mt — 2 do 

Randomly select an im. from {1, 2, ■ • • , n}; 
w^+i = nxiyvm-rjtly 

m ) - V/i„ (Wo) -I- Zt -I- A^fSt + pA^ (Awm -f Byt - 

c)]); 

s = s -I- w^+i; 

end for 

yt+i = argminy L(xt+i,y,/3t); 

/3t+i = /9t -f p(Axt+i -t Byt+i - c); 

end for 

Outpnt: XT = ^ ELi yr = ^ ELi y* 


Compared with SVRG [TS], the update rule in 0 has an extra vector A'^f3t+ 
pA'^{Awrn + Byt - c) = pA'^Awm -f A'^(/3t -I- p(Byt - c)). If matrix A = 
0, which means By = c, X and y are independent. Then Algorithm [C will 
degenerate to SVRG since we only need to solve the minimization problem about 
/(x) and g{y) separately. We can find that SCAS-ADMM is more general than 
SVRG since it can solve the minimization problem with more complex equality 
constraints. 

Besides the memory to store A and B, the memory to store Z(, wg, s, and 
Wm+i is only 0{p), where p is the number of parameters, i.e., the length of vector 
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X. Furthermore, it only needs some other memory to store {xt|t = 0,1, • • • , T} 
and {yt\t = 0,1, • • • ,T}. This memory cost is typically small because T is 
not too large in practice. For example, T = 15 is enough for SCAS-ADMM 
to achieve satisfactory accuracy in our experiments which will be presented in 
Section Furthermore, we can also find that SCAS-ADMM does not need to 
store the historic gradients for all samples which are used in SA-ADMM. Hence, 
SCAS-ADMM is scalable in terms of storage cost. 

3.1.2 Convergence Analysis 

We call a set X is bounded by D if it satisfies: sup ||x — x'|| < D, where D 

x,x''G A' 

is a constant. 

Assume we have got {xt,yt, l3t), and we define: 


C{x) =L(x,yt,/3t). 


( 10 ) 


We can get the following convergence theorem. 

Theorem 1. Assume the optimal solution of & is (x*,y*,/3*), X is bounded 
by D and contains x*, /(x) and all the functions {/^(x)} are general convex and 
Vf-smooth, and the function g{y) is convex. We have the following convergence 
result for Algorithm^ 


E [/(xr) -H giyr) - /(x*) - g(y*) + 7 || Axt -H Byr - c||] 


T-1 r 


4 E 


t=0 




2Mtr]t 


+ + GI) 


2T 


|yo ■ 


■ y»llH + ^(ll/^oll^ + 7^)) 


( 11 ) 


where H = B^B, ||x||^ = x’^Hx, 7 > 0 is a constant, vc the Lipschitz 
constant of C{x), and Gt = ||V£(X()|j. 

Lei e* = SMbh + + G?)- To make /(xt) -k giyr) converge to 

/(x*) -f g{y*), we need to make sure that ct is bounded or not too large. 

By taking gt = + Gl){t + we have: 

• If (5 > 1, then is a constant, which means that /(xt) -I- giyr) 

converges to /(x») -|- ^(y*) with a convergence rate of O(^). 

• If (5 = 1, then YuYq Ct = OflogT), which means that /(xt) + gifr) 
converges to /(x*) -|- ^(y*) with a convergence rate of 

Hence, by choosing <5 > 1, we can get a convergence rate 0{^) for our SCAS- 
ADMM on general convex problems, which is the same as the best convergence 
rate achieved by existing stochastic ADMM method (SA-ADMM). 
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3.2 Strongly Convex Problems 

In Algorithm [l] with the increase of t, the iteration number of the inner loop 
Mt needs to be increased and the step size r]t needs to be decreased. This might 
cause large computation when T gets large. We can get a better algorithm when 
/(x) in 0 is strongly convex. 

3.2.1 Algorithm 

When /(x) is strongly convex, our SCAS-ADMM is briefly presented in Algo¬ 
rithm 1^ We can find that Algorithm is similar to Algorithm but with 
constant values for Mt and 77 *. 


Algorithm 2 SCAS-ADMM for strongly convex problems 
Initialize: (xo,yo,/3o), r = 2rj — , s = a convex set A; 

for t = 0 to r — 1 do 

Compute zt = V/(xt) = ^ VA(xt); 

Wo = Xt; 

s = 0; 

for m = 0 to M — 1 do 

Randomly select an from {1, 2, ■ • • , n}; 

W^+I = nx i'Wm - (w,„) - V(wo) + Zt + A^f3t + pA^( Awm -f Byt - 

c)]); 

W^+I = + sw^+i); 

s = s -(- w,„+i; 

end for 

Xt+i = ^s; 

yt+i = argminy L(xt+i,y,/3t); 

Pt+i = /3t -f p(Axt+i -t Byt+i - c); 

end for 

Outpnt: XT = ^ ELi Xt, yr = ^ ELi y* 


3.2.2 Convergence Analysis 

Theorem 2. Assume the optimal solution of 0) is (x*,y»,/3*), all the func¬ 
tions {/i(x)} are general convex and Vf-smooth, /(x) is strongly convex and 
Vf-smooth, and giy) is convex. We have the following result: 

E [/(xt) + gifr) - /(x*) - 5r(y*) -f 7 ||Axt -f Byr - c||] 

-|r ^ ^(ll/3of + 7^), (12) 

where H = B^B, and j > 0 is a constant. 

In this case, we can set M and 77 to be constants. Please note that in 



the maximum eigenvalue of A^A, and a = I — — 1^. Different from 
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Algorithm]^ we do not need the convex set X in Algorithmto be bounded or 
we do not even need such a set for unconstrained problems. 

3.3 Comparison to Related Methods 

We compare our SCAS-ADMM to other stochastic ADMM methods in terms 
of three key factors: penalty term linearization, convergence rate on general 
convex problems and memory cost. The matrix inversion (can be 

avoided by linearizing the penalty term ^ || Ax + By — c||^ [3]. Hence, penalty 
term linearization can be used to decrease computation cost. The comparison 
results are summarized in Table [l] where SA-IU-ADMM is a variant of SA- 
ADMM with penalty term linearization. Please note that A G B S 
X G RP, y € c G p is the number of parameters to learn, and n is the 
number of training samples. 

It is easy to see that only SCAS-ADMM can achieve the best performance in 
terms of both convergence rate and memory cost. Other methods either achieve 
only sub-optimal convergence rate, or need more memory than SCAS-ADMM. 
In particular, SA-ADMM and SA-IU-ADMM need an extra memory as large as 
0{np) to store the historic gradients for all samples. Typically, n is very large 
in big data applications. Furthermore, SCAS-ADMM can also avoid the matrix 
inversion by linearizing the penalty term. Hence, SCAS-ADMM does be salable 
in terms of both computation cost and memory cost. 

Table I: Comparison to related methods 


Method 

Penalty term linearization? 

Convergence rate 

Memory cost 

OADM [5] 

NO 

Oil/VT) 

0{lp + Iq) 

STOC-ADMM [B] 

NO 

Oil/VT) 

0{lp + Iq) 

OPG-ADMM [3] 

YES 

Oil/VT) 

0{lp + Iq) 

RDA-ADMM [3] 

YES 

o{i/Vt) 

0{lp + Iq) 

OS-ADMM [7] 

YES 

0{i/Vt) 

0{lp + Iq) 

SA-ADMM [4] 

NO 

Oil/T) 

0{np + lp + Iq) 

SA-IU-ADMM m 

YES 

0(1/T) 

0{np + lp + Iq) 

SCAS-ADMM 

YES 

0(1/T) 

0(lp -\- Iq) 


4 Experiments 

As in 0011], we evaluate our method on the generalized lasso model |16j which 
can be formulated as follows: 

1 "■ 

min-V/j(x)-hA||Ax||^, (13) 

X n 

2=1 

where /i(x) is the logistic loss, A is a matrix to specify the desired structured 
sparsity pattern for x, and A is the regularization hyper-parameter. We can get 















different models like fused lasso and wavelet smoothing by specifying different 
A. In this paper, we focus on the graph-guided fused lasso [5] which is also used 
in HI- As in [Sll], we use sparse inverse covariance selection method m to get 
a graph matrix (sparsity pattern) G, based on which we can get A = [G; I]. In 
general, both G and A are sparse. 

We can formulate (13) with the ADMM framework: 


1 ^ \ 

min P(x, y) = - ^ /^(x) -h ^(y), (14) 

X,y n 

2=1 

s.t. Ax — y = 0, 


where g(y) = A||y||i. 

4.1 Baselines and Datasets 

Three representative ADMM methods are adopted as baselines for comparison. 
They are: 

• Batch-ADMM [T]: The deterministic (batch) variant of ADMM which 
uses Q to directly update x by visiting all training samples in each iter¬ 
ation. 

• STOC-ADMM [^: The stochastic ADMM variant without using historic 
gradient for optimization, which has a convergence rate of 0(1/Vt) for 
general convex problems and 0(logT/T) for strongly convex problems. 

• SA-ADMM [4j: The stochastic ADMM variant by using historic gradient 
to approximate the full gradient, which has a convergence rate of 0{1/T) 
for general convex problems. 

Please note that other methods, such as OPG-ADMM, RDA-ADMM and OS- 
ADMM, are not adopted for comparison because they have similar convergence 
rate as STOC-ADMM. Furthermore, both theoretical and empirical results 
have shown that SA-ADMM can outperform other methods like RDA-ADMM 
and OPG-ADMM [3]. The variant of SA-ADMM, SA-IU-ADMM, is also not 
adopted for comparison because it has similar performance as SA-ADMM [1]. 

Although the Mt in Algorithm should be increased as t increases, we 
simply set Mt = n in our experiments because SCAS-ADMM can also achieve 
good performance with this fixed value for Mt. Similarly, we set M = n in 
Algorithm 

As in [3] , four widely used datasets are adopted to evaluate our method and 
other bas^nes. They are a9a, covertype, rcvl and^ sidq. All of them are for 
binM^ 

6@4oWd^feFfMe 

and the average 

by using the same values in [3] , which are also listed in Table We adopt 
the same strategy as that in [3] to set the hyper-parameters p in ([^ and the 


testing. This random partition is repeated for 10 times 
■^ues are reported. The hyper-parameter A in ([l4| is set 
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Table 2: Information about the datasets 


Dataset 

^Samples 

^Features 

A 

a9a 

32561 

123 

lO"'" 

covertype 

581012 

54 

lO'^’ 

rcvl 

20242 

47236 

10"“ 

sido 

12678 

4932 

10"“ 


stepsize. More specifically, we randomly choose a small subset of 500 samples 
from the training set, and then choose the hyper-parameters which can achieve 
the smallest objective value after running 5 data passes for stochastic methods or 
100 data passes (iterations) for batch methods. As in [4], we use y(xT) = Axt 
to replace yx since the methods cannot necessarily guarantee that Axy = yx- 

All the experiments are conducted on a workstation with 12 Intel Xeon CPU 
cores and 64G RAM. 

4.2 Convergence Results 

As in we study the variation of the objective value on training set and the 
testing loss versus the number of effective passes over the data. For all methods, 
one effective pass over the data means n samples are visited. More specifically, 
one effective pass refers to one iteration in batch ADMM. For stochastic ADMM 
methods which visit one sample in each iteration, one effective pass refers to n 
iterations. For SCAS-ADMM, we set Mt = n and each iteration of the outer 
loop needs to visit 2n training samples. Hence, each iteration of the outer 
loop will contribute two effective passes. Although different methods will visit 
different numbers of samples in each iteration, we can see that the number of 
effective passes over the data is a good metric for fair comparison because it 
measures the computation costs of different methods in a unified way. 

Figure shows the results for general convex problems with /i(x) being 
the logistic loss. Please note that the number of recorded points on the curve 
of SCAS-ADMM is half of those for other methods because each iteration of 
the outer loop of SCAS-ADMM will contribute two effective passes. As stated 
above, it is still fair to compare different methods with respect to the number 
of effective passes. In Figure all the points with the same x-axis value from 
different curves have the same number of effective passes. Hence, for two points 
with the same x-axis value from any two different curves, the point with smaller 
y-axis value is better than the other one. We can find that all the stochastic 
methods outperform the Batch-ADMM in terms of both training speed and test¬ 
ing accuracy. SCAS-ADMM and SA-ADMM outperform STOC-ADMM, which 
is consistent with the theoretical analysis about convergence rate. Our SCAS- 
ADMM can achieve comparable performance as SA-ADMM, which empirically 
verifies our theoretical result that SCAS-ADMM has the same convergence rate 
of 0(1/T) as SA-ADMM. 
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By adding a small L 2 regularization term to the logistic loss, we can get 
strongly convex problems. Figure shows the results for strongly convex prob¬ 
lems. Once again, we can observe similar phenomenon as that in Figure [l] 
In particular, our SCAS-ADMM can achieve comparable convergence rate as 
SA-ADMM. 

As for the memory (storage) cost, it is obvious that SCAS-ADMM needs 
much less memory than SA-ADMM from the theoretical analysis in Table 
Hence, we do not empirically compare between them. 

5 Conclusion 

In this paper, we have proposed a new stochastic ADMM method called SCAS- 
ADMM, which can achieve the same convergence rate as the best existing 
stochastic ADMM method SA-ADMM on general convex problems. Further¬ 
more, it costs much less memory than SA-ADMM. Hence, SCAS-ADMM is 
scalable in terms of both convergence rate and storage cost. 
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A Notations for Proof 

We let 


Vm.i = V/i^(w„) - V/j^(wo) + Zt, (15) 

bm,i =/3t + p(Awm + Byt - c), (16) 

Pm,i — ^m,t “t“ A (1"^) 

Then the update rule for Wm+i in the inner loop of Algorithm can be 
rewritten as 

Wm+l = TTx{Wm - ritiym,t + A^b.^,*)) 

= T^xi^rn- (18) 

Assume we have got (xt,yt,/3t), and we define: 

£(x) =L(x,yt, A), (19) 

A(x) =/i(x) +g(y 4 ) +^f(Ax + Byt - c) + | ||Ax + Byt - cf . (20) 


B Lemmas for the Proof of Theorem 1 


Lemma 1. ///(x) is Vf-smooth, then 3vc > 0 that makes £(x) be vc-smooth. 
Proof. According to the definition about j//-smooth, Va, b, we have 

||V/:(b) - V£(a)|| = ||V/(b) - V/(a) + pA^A(b - a)|| 

< ||V/(b) - V/(a)|| + ||pA^A(b - a)|| 

< |jb — a|| + ||pA^A(b — a)|| 

< Uf ||b - a|| + p\/aX ||b - a|| 

= (vf+p^/Xx) ||b-a|| , 

where Aa > 0 is the largest eigenvalue of A^AA^A. 

Hence, for any value of > (j^/ + pVXa), we can see that £(x) is i^c- 
smooth. □ 


We can find that vc is only determined by /(x), matrix A and the penalty 
parameter p, but has nothing to do with g{yt), Yt, B and f3t. 

Then, we have the following lemma about the variance of Pm,t- 

Lemma 2. The variance ofpm.t satisfies: 

Ei\\p^£)<2nlD^ + 2Gl ( 21 ) 


where D is the bound of the domain of x, is the Lipschitz constant of the 
function £(x) defined in [19), and Gt = ||V£(xt)||. 
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Proof. According to ( |l7| ), we have 
Pm,t ~^m^t A A 

='^/i™(wm) + - /i„(wo) - A^bo,t + Zt + A^bo,* 

=VA^(wm) - V£i^(wo) + V£(wo). 

Then we have: 

n 

EdlPm.tf) =- V ||VA(W^) - VA(wo) + V/:(wo)f 
n 

2=1 
9 ^ 

<- V{||VA(w^) - VA(wo)f + ||V/:(wo)f} 

n 

2=1 
2 "" 

<{- V ||VA(w™) - VA(wo)f} + 2 ||V/:(wo)f 
n 

2 = 1 

<2v\D^ + 2G\. 

Please note that Wq = Xt, and we use the Lipschitz definition to get the result: 
||V£i(w„) - V£i(wo)||^ < v\ ||wm - woll^ < v^D^. 

□ 


Lemma 3. For the estimation o/x(_|_i, we have the following result: 

E [/(xt+i) - /(x) + (A^at+i)^(xt+i - x)] < + r]t{vlD^ + G?), (22) 

where aj+i = A + p(Axt+i + By^ - c). 

Proof. Since X is convex, we have: 

VxG A, 


IWm+l - x||^ < llw^ - r]tPm,t - X||^ 


(23) 


= Wm, - X 


- 2rytP^,t(w^ - x) + r]f \\Pm,t\\ ■ 


Furthermore, it is easy to prove that E \vm,t] = V/(wm). Based on the results 
in Lemma we can get the expectation on (23): 


E 

<E 

<E 


|Wm+l - X 

iw^, - xll^ 


(24) 

- 2r]tE[(y f{Wm) + - x)] + ?7jE(||pm,t|l^) 

+ + 2Gt) - 27?tE[/(w„) - /(x) + (A^b^_t)^(w™ - x)]. 
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Summing up (241 from m = 0 to Mt — 1, we can get: 


Mt-l 

2rit ^ E[/(w„^) - /(x) + (A^bm,t)'^(w„ - 

m—0 


< IIwq — x|j + Mtr]'l{2v\D^ -\-2G^) 

<\\^^-^\\^ + Mtrii{2vlD^ + 2Gl) 
<L»2 + Mtr]f{2iylD^ + 2G‘i). 


E 


llwMt 



2 


We can prove that /(w^) — /(x) + (A^bm,t)^(wm — x) is convex in w^- 
Furthermore, we have xj+i = By using the Jensen’s inequality, 

we have: 


277 tM(E [/(xt+i) - /(x) + {A.^ OLt+if {yit+i - x)] 
Mt-l 

<2r]t ^ E[/(wm) - /(x) + {-Wm - x)] 

m—0 

<D^ + MtT]f{2iylD'^ + 2G?), 


where «*+! = /3t + p(Axt+i + Byt - c). 
Then, we can get: 


E [/(xt+i) - /(x) + (A^at+i)^(xt+i - x)] < + ritivlD'^ + G^). (25) 

□ 


According to the results in [3], we have the following Lemmaand Lemma 
about the estimation of yt+i and ott+i- 

Lemma 4. For the estimation ofyt+i, we have: 


E [^(yt+i) - g{y) + (B'^at+i)^(y4+i - y)] 

|yt 


(26) 


< ^E 


ylln - llyt+i -ylln - lly* - yt+i"^ 


H 


where cxt+i = A + p(Axt+i + By* - c), H = B^B, and ||y||H = y^Hy. 
Lemma 5. For the estimation of at+i, we have: 


E [-(Axt+i + Byt+i - c)'^(at+i - a) 


(27) 


< —E 

- 2p 


||/3t - ol\\^ - ||/3t+i - a|' 


+ |e 


|yt - yt+i| 


H 


where at+i = f3t + p{Axt+i + By* - c), H = B^B, and ||y||H = y^Hy. 

The proof of Lemmaj^and Lemmaj^can be directly derived from the results 
in [4], which is omitted here for space saving. 
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C Proof of Theorem [T] 


Proof. Let u = 



yt h ut = andF(u) = 


Summing up the equations in (22), (26) and (27), we have: 


E [P(xt+i, yt+i) - P(x, y) + J^(ut+i)'^(ut+i - u)] 


<- 




'2Mtrit 

^E 


+ rit{i'cD + G^) 


2 p 


|yt-y|iH-llyt+i-yllH 


\\f3t - all^ - ||/3t+i - a|' 


l|yt - yt+ii 


H 


^E 


|yt -yi+i| 


H 



(28) 


It is easy to prove that P(xt, yt) —P(x,y) + L"(ut)’^(ut — u) is convex in (xt, yt). 
Moreover, we have xt = ^ Yt = ^ J2j=i Yt- By using the Jensen’s 

inequality, we have: 


P(xT, yr) - -P(x, y) + P(ut)^(ut - u) 
^f'^[Pi^t,Yt)-P{^,Y) + Fiutf{ut-u)] . 


(29) 


Summing up 


from t = 0 to T — 1, and using the result in (29), we have: 


E [P(xt, yr) - ^’(x, y) + F(ur)'^(uT - u)] 

1 

-y XI ® [^(xt+i^yt+i) - -P(x,y) + F(ut+i)'^(ut+i - u)] 


t^o 
T-1 r 


%E 


t=0 




2Mtrit 


+ r]t{v'^D^ + Gt) 


+ ^ llyo - ylln + ^ 11^0 - 


(30) 


The result in (30) is satisfied for any (x, y, a). In particular, if we take x = x* 
AxT+ByT~c 

||AxT+ByT-c|| 


y = y* and a = 7 n ^ have: 


E [P(xT,yT) - P(x^,y*) + 7 ||Axt + Byr - > 


T-l 




t=0 




2MtT]t 


+ VtWcF +^t) 


+ ^l|yo-y*7 + ^(ll/3or + 7"). 


(31) 


□ 


16 





















D Lemmas for the Proof of Theorem [2] 

Lemma 6. The variance of'Pm,t satisfies: 

Vx, E(||p™,tf) < dm + ||V£(wm)f 

< 2vl ||wm - x||^ + 2nl ||wo - x||^ + |iV£(wm)||^ . (32) 

where dm = ^ ^ 27=1 ||VA(wm) - VA(wo)|l^. 

Proof. Vx 

n 

E(|lPm,tf) =- V ||VA(Wm) - VA(wo) + V/:(wo)f 

i—l 

1 "" 

= II V/:(Wm)f - ||V£(Wm) - V£(wo)f + - V II V£,(wm) - V£,(wo)f 

n 

2=1 

1 "" 

< II V£(Wm)f + - V II V£,(Wm) - VA(wo)f 

n 

2=1 

<1^1 ||Wm - Woll^ + ||V£(Wm)f 

<2vl ||Wm - xf + 2vl ||wo - xf + ||V£(Wm)f . 


□ 

2 

Lemma 7. If rj — > 0, we have the following result for the variance of 

V£(wm): 


|V£(w, 


mJll ^ 


< 


few 

2 


-(£(Wm) - E[£(Wm+i)]) 


ven ^ 

2 - veg 


(33) 


Proof. Since £(w) is convex in w, we can get 

£(Wm+l) < £(Wm) + V£(Wm)^(Wm+l “ W^) + ^ ||Wm+l “ Wm||^ ■ 

Taking expectation on both sides of the above equation, we get 

2 

E[£(Wm+l)] < £(Wm) - r] ||V£(Wm)||^ + ^^^EdlPmuf)- 

According to Lemma we have 

2 

E[£(Wm+l)] < £(Wm) - rj ||V£(Wm)||^ + -^{dm + || V£(Wm ) || ^) • 

Then we have 

2 2 

(d - II V£(Wm)||^ < £(Wm) - E[£(Wm+l)] + '^^^dm- (34) 
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Choosing a small rj such that 77 — > 0, we have 

|lV/:(w^)f < - - E[£(w^+i)]) + dm- 

r] -^ ~ i^cV 

□ 

Lemma 8. We have the following result: 

E(||wm+1 - x||^) + 2r]\7C{wm)'^{Wm - x) + —-^^(E[/:(wm+i)] - £(x)) 

< l|w™ - xf + (£(w^) - C{x)) + ^ - dm- 

1- ^ 2 - i^cr] 

Proof. From ( |I^ , we can get Vx, 

||w™+l - xf < ||w^ - xf - 277P^_i(w^ - x) + f llPm.tf . 

Then, we have 

Edl-w^+i - xf) < ||w„^ - xf - 277V£(wm)'^(wm - x) + ry^Edlp^^tf )■ 

According to Lemma and Lemma we have 

E(||v^fm+i - xf) + 2r]\7£{wm)'^{wjn - x) < ||wm - xf + f + 77^ ||V£(wm)f 

3 

< llw^ - xf + g^dm + Z —- E[/:(w^+i)]) + — d, 

1- ^ 2 - z/£77 

Then, we can get 

Ed|wm+1 - xf) + 2r]\7C{wm)'^{Wm - x) + —’^(E[/:(wm+i)] - £(x)) 

< ||w™ - xf + (Civ^m) - £(x)) + - dm- 

1- ^ 2 - vcd 

□ 

Lemma 9. Let Ai denote the maximum eigenvalue o/A^A, Vw,x, y,/3, 

CM - /:(x) > /(w) - /(x) + (A^b)^(w - X) - ^ ||w - xf , 
where h = (3 + p( Aw + By — c). 
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Proof. According to the definition £(w) = /(w) + g{y) + /3^(Aw + By — c) + 
I |j Aw + By — c||^, we have 


C{w) - /:(x) 

=/(w) - /(x) + /3^A(w - x) + ^ II Aw + By - cf - ^ || Ax + By - c|| 


=/(w)-/(x)+/3^A(w-x) + ^ 


P 

2 L 


lAwll'- IIAxIl' 


2(By - c)'^A(w - x) 


=/(w) - /(x) + (A^(^ + p(By - c))^(w - x) + ^(|| Aw^ - || Ax||^) 
=/(w) - /(x) + (A^(/3 + p(Aw + By - c))^(w - x) - ^ ||Aw - Axf 
=/(w) - /(x) + (A^b)^(w - x) - I II Aw - Axf 
>/(w) - /(x) + (A^b)^(w - x) - ^ llw - xf . 


□ 


Lemma 10. ///(w) is strongly convex and fif > pXi, we have the following 
result 

E [/(xt+i) - /(x) + (A^Q:t+i)'^(xt+i - x)] < ^(||xt - xf - E llxt+i - xf), 

(35) 

where ott+i is the same as that in Lemma^ 

Proof. Note that 


r A 2ri _^ 


A V 
s = 


1 _ • 

2 


Since /(w) is strongly convex in w, we can prove that /l(w) is also strongly 
convex in w. Then we have 

V/:(wm)^(wm - x) > £(wm) - /:(x) + ^ ||w„ - xf . (36) 

By combining the results in Lemma and p6[ ) , we have 

E llwm+i - xf + ||wm - xf + rV/:(wm)^(wm - x) + s(E[/:(wm+i)] - C{x)) 

< |lw„^ - xf + —- dm- 

2 - vlp 

For convenience, we use 

Dm = /(wm) - /(x) + (A^b„,t)'^(w„ - x), 
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where the definition of b^.t is in (16). 
And we also have 


V£(w™) = V/(wm) + A^hm,t, 

V/(w„)^(w„ - x) > /(w^) - /(x) + ^ ||w^ - xf . 

Then according to Lemma and Lemma we can get 

psXi PfS 2n , ^J-CS I, ||2 

(1- 2 -^)E(llWm+l - x|| ) + ^ llwm - x|| 

+ r{D^ + ^ ||w„ - x||^) + sE{Dm+i + ^ ||w™+i - x||^) 


I i II l |2 . 


2 - I^cv 


2 - I^cv 


|wo - x|| 


i.e., 

psXi PfS 2x 

(1- 2 -^)E(llWm+l -X|| ) 

+ r{Dm + ^ ||w„ - x||^) + sE{Drr,+l + ^ ||w™+i - x||^) 

Here, ry need to satisfy the following condition: 

^ _P^ _ P^ ^ ^ _ psXi _ p^ 

2-vcp 4 2“ 2 4’ 


i.e., 


(4 j ,2 + pXi < PC- 

Let a = 1 — We have 

OfE ||wm+i - x||^ + rE{Dm + ^ ||wm - x||^) + sE{Djn+i + ^ ||w™+i - x||^) 


<aE|lwm - xll^ + 


|Wo - x|| 


2 - i^cV 

Note that r + s = 277, and Dm is convex in w^- We take 

Wm+I = ^{r-Wm + SWm+l), 

2ry 

which is a convex combination of and Wm+i- Then we have 


aE ||wm+i - x||^ + 2pE{Dm+i + ^ |lw„+i - x||^) < aE ||wm - x||^ + 

4 2 - vcV 

(37) 


llwo -xf . 
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where Dm+i = /(w^+i) - /(x) + {A'^hm+i,tV{'Wjn+i - x), and h.m+i,t = 
13 + p{A^rn +1 +By-c). 

Summing up (37) from m = 0 to M— 1, and taking Xt+i = 


we have 


2Mr]E{f{xt+i) - /(x) + (A^at+i)'^(xt+i - x) + ^ ||xt+i - xf) 
<(q, + |jwo - xf , 

2 - vcV 

where «*+! = f3 + p{Axt+i + By - c). 

Then, we have 

E [/(xt+i) - /(x) + (A^at+i)'^{xt+i - x)] 

,2 


<( 


a 


7 ^ + 7 ^-^) ll^t - xf - ^E(||xt+i - xf ) 
2Mr] 2 — i/c^ 4 


“ E(||xt+i - xf)), 


where we assume that —h 

2Alrf 2—VL'n — 4 


□ 


E Proof of Theorem [2] 


Proof. Let u = 


y 

a 


, Ut = 


Xt 

yt 

a* 


Ut = ^ ELi = 


Summing up the equations in (35), (26) and (27), we have: 
E [P(xt+i, yt+i) - P(x, y) + J^(ut+i)^(ut+i - u)] 


<^(||x*-xf-E|lx*+i-xf) 


-E 


|yi -y|lH 


l|yi+i 


IlH 


l|yt -yt+illn 


-E 


2p 


11/3* - «ll^ - ll/3*+i - «!' 


-E 


ly* -y*+illH 


A^a 

B^a 

— (Ax + By — c) 


It is easy to prove that P(X(, y^) — P(x,y) + F(ut)’^(ut — u) is convex in (xj, yt). 
Moreover, we have xt = ^ EtLi y^ = ^ E*Li y*- ^y using the Jensen’s 

inequality, we have: 


Pi^T, Yt) - -P(x, y) + P(ut)^(ut - u) 


21 
















Then, we have: 


E [-P(xt, Yt) - ^’(x, y) + F{ut)'^{ut - u)] 

1 

-y XI ® [^(xt+i^yt+i) - ^(x,y) + F(ut+i)'^(ut+i - u)] 


t=0 


11^0 - xf + ^ llyo - ylln + ^ ll/ 3 o - af ■ 


( 38 ) 


The result in (38) is satisfied for any (x, y, a). In particular, if we take x = x* 

Axr+Byj 
||AxT+ByT-c|| 


y = y* and a = 7 ^ ^e have: 


E [P(xt, yr) - ^’(x*, y,) + 7 |1 Axt + Byr - c||] 
l|xo - x*f + ^ llyo - y*||H + ^(ll/3of + 7^)- 

□ 
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(e) a9a 





Figure 1: Experiments on four datasets for general convex problems. Top: objective 
value on training set; Bottom: testing loss. 
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(a) a9a 


(b) covertype 






(f) covertype 




Figure 2: Experiments on four datasets for strongly convex problems. Top: objective 
value on training set; Bottom: testing loss. 
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